Methods and ideas to be covered in this notebook:
import pandas as pd
import numpy as np
import helper
import project_helper_zh
import project_tests
%matplotlib inline
df_original = pd.read_csv('eod-quotemedia.csv', parse_dates=['date'], index_col=False)
# Add TB sector to the market
df = df_original
df = pd.concat([df] + project_helper_zh.generate_tb_sector(df[df['ticker'] == 'AAPL']['date']), ignore_index=True)
close = df.reset_index().pivot(index='date', columns='ticker', values='adj_close')
high = df.reset_index().pivot(index='date', columns='ticker', values='adj_high')
low = df.reset_index().pivot(index='date', columns='ticker', values='adj_low')
print('Loaded Data')
为了查看这些二维矩阵是什么样的,我们看看收盘价矩阵。
close
我们通过收盘价矩阵看看单个股票是什么样的。对于此示例以及此项目中的后续其他示例,我们将使用 Apple 的股票 (AAPL)。如果要绘制所有股票的图形,那么信息太多了。
apple_ticker = 'AAPL'
project_helper_zh.plot_stock(close[apple_ticker], '{} Stock'.format(apple_ticker))
在此项目中,你需要编写并评估“突破”信号。务必要了解这些步骤在 alpha 研究工作流程中所处的阶段。交易信号中的信噪比很低,很容易过拟合噪点。所以不建议立即开始信号编程。为了避免过拟合,建议先提出一般的假设,即在处理任何数据之前,你应该能够回答以下问题:
什么样的市场或投资者行为特征会导致一直出现的异常,并且我的信号可以使用这种异常?
理想情况下,在开始编程和评估信号本身之前,应该能够测试假设条件。工作流程如下所示:

在此项目中,我们假定前三个步骤(观察和研究、提出假设、验证假设)已经完成。对于此项目,你将用到以下假设:
我们利用这种假设开始编程吧。
我们将根据高低价格创建突破策略。在此部分,请实现 get_high_lows_lookback 以获得窗口期内的最高价格和最低价格。变量 lookback_days 包含要查看的过去日期,请勿包含当前日期。
def get_high_lows_lookback(high, low, lookback_days):
"""
Get the highs and lows in a lookback window.
Parameters
----------
high : DataFrame
High price for each ticker and date
low : DataFrame
Low price for each ticker and date
lookback_days : int
The number of days to look back
Returns
-------
lookback_high : DataFrame
Lookback high price for each ticker and date
lookback_low : DataFrame
Lookback low price for each ticker and date
"""
# .shift(1) makes sure it does not include the current day
return high.shift(1).rolling(lookback_days).max(), low.shift(1).rolling(lookback_days).min()
project_tests.test_get_high_lows_lookback(get_high_lows_lookback)
我们使用 get_high_lows_lookback 获取过去 50 天的高低价格,并与相应的股票进行比较。与之前一样,我们将以 Apple 股票为例。
lookback_days = 50
lookback_high, lookback_low = get_high_lows_lookback(high, low, lookback_days)
project_helper_zh.plot_high_low(
close[apple_ticker],
lookback_high[apple_ticker],
lookback_low[apple_ticker],
'High and Low of {} Stock'.format(apple_ticker))
根据生成的最高和最低价格信号,利用突破策略创建做多和做空信号。实现 get_long_short 以生成以下信号:
| 信号 | 条件 |
|---|---|
| -1 | 最低价 > 收盘价 |
| 1 | 最高价 < 收盘价 |
| 0 | 其他 |
在此图表中,收盘价是 close 参数。最低价和最高价是 get_high_lows_lookback 生成的 lookback_high 和 lookback_low 参数。
def get_long_short(close, lookback_high, lookback_low):
"""
Generate the signals long, short, and do nothing.
Parameters
----------
close : DataFrame
Close price for each ticker and date
lookback_high : DataFrame
Lookback high price for each ticker and date
lookback_low : DataFrame
Lookback low price for each ticker and date
Returns
-------
long_short : DataFrame
The long, short, and do nothing signals for each ticker and date
"""
long = (close > lookback_high).astype('int')
short = (close < lookback_low).astype('int')*-1
return long + short
project_tests.test_get_long_short(get_long_short)
我们将你创建的信号与收盘价进行比较。此图表将显示很多信号。实际上太多了。我们将在下个问题中讨论如何滤除多余的信号。
signal = get_long_short(close, lookback_high, lookback_low)
project_helper_zh.plot_signal(
close[apple_ticker],
signal[apple_ticker],
'Long and Short of {} Stock'.format(apple_ticker))
重复信号太多了!如果我们已经做空股票,再有一个做空信号并没有多大作用。如果上一个信号是做多信号,那么再有其他做多信号也一样多余。
实现 filter_signals 以滤除 lookahead_days 中重复出现的做多或做空信号。如果上个信号一样,将信号变成 0 (什么也不做信号)。例如,假设有一个如下所示的股票时序:
[1, 0, 1, 0, 1, 0, -1, -1]
运行 filter_signals 并向前看 3 天会将信号变成:
[1, 0, 0, 0, 1, 0, -1, 0]
为了帮助你实现该函数,我们提供了 clear_signals 函数。它会删除窗口中上个信号之后的所有信号。例如,假设将 clear_signals 的窗口大小设为 3,它会将以下做多信号序列:
[0, 1, 0, 0, 1, 1, 0, 1, 0]
变成
[0, 1, 0, 0, 0, 1, 0, 0, 0]
clear_signals 仅接受信号类型一样的序列,其中 1 表示信号,0 表示没有信号。不能将做多和做空信号混合到一起。请使用此函数实现 filter_signals。
在实现 filter_signals 时,不建议寻找向量化解。应该针对每列使用 iterrows。
def clear_signals(signals, window_size):
"""
Clear out signals in a Series of just long or short signals.
Remove the number of signals down to 1 within the window size time period.
Parameters
----------
signals : Pandas Series
The long, short, or do nothing signals
window_size : int
The number of days to have a single signal
Returns
-------
signals : Pandas Series
Signals with the signals removed from the window size
"""
# Start with buffer of window size
# This handles the edge case of calculating past_signal in the beginning
clean_signals = [0]*window_size
for signal_i, current_signal in enumerate(signals):
# Check if there was a signal in the past window_size of days
has_past_signal = bool(sum(clean_signals[signal_i:signal_i+window_size]))
# Use the current signal if there's no past signal, else 0/False
clean_signals.append(not has_past_signal and current_signal)
# Remove buffer
clean_signals = clean_signals[window_size:]
# Return the signals as a Series of Ints
return pd.Series(np.array(clean_signals).astype(np.int), signals.index)
def filter_signals(signal, lookahead_days):
"""
Filter out signals in a DataFrame.
Parameters
----------
signal : DataFrame
The long, short, and do nothing signals for each ticker and date
lookahead_days : int
The number of days to look ahead
Returns
-------
filtered_signal : DataFrame
The filtered long, short, and do nothing signals for each ticker and date
"""
f_signal = signal.copy()
for sector, row in (signal.iteritems()):
long = row.copy()
short = row.copy()
long[long<0] = 0
short[short>0] = 0
f_signal[sector] = clear_signals(long, lookahead_days) + clear_signals(short, lookahead_days)
return f_signal
project_tests.test_filter_signals(filter_signals)
下面看看之前的同一图表,但是删除了多余的信号。
signal_5 = filter_signals(signal, 5)
signal_10 = filter_signals(signal, 10)
signal_20 = filter_signals(signal, 20)
for signal_data, signal_days in [(signal_5, 5), (signal_10, 10), (signal_20, 20)]:
project_helper_zh.plot_signal(
close[apple_ticker],
signal_data[apple_ticker],
'Long and Short of {} Stock with {} day signal window'.format(apple_ticker, signal_days))
创建了交易信号后,我们将判断应该做多或做空多少天的股票。在此问题中,请实现 get_lookahead_prices 以获取提前几天的收盘价。你可以从变量 lookahead_days 中获取天数。我们将在另一个问题中使用前瞻价格计算未来收益率。
def get_lookahead_prices(close, lookahead_days):
"""
Get the lookahead prices for `lookahead_days` number of days.
Parameters
----------
close : DataFrame
Close price for each ticker and date
lookahead_days : int
The number of days to look ahead
Returns
-------
lookahead_prices : DataFrame
The lookahead prices for each ticker and date
"""
return close.shift(-lookahead_days)
project_tests.test_get_lookahead_prices(get_lookahead_prices)
我们使用 get_lookahead_prices 函数生成 5 天、10 天和 20 天的前瞻收盘价。
我们绘制几个月(而不是几年)的 Apple 股票图表,以便查看 5 天、10 天和 20 天前瞻期的区别。否则,在查看缩小的图表时,数据将挤在一起。
lookahead_5 = get_lookahead_prices(close, 5)
lookahead_10 = get_lookahead_prices(close, 10)
lookahead_20 = get_lookahead_prices(close, 20)
project_helper_zh.plot_lookahead_prices(
close[apple_ticker].iloc[150:250],
[
(lookahead_5[apple_ticker].iloc[150:250], 5),
(lookahead_10[apple_ticker].iloc[150:250], 10),
(lookahead_20[apple_ticker].iloc[150:250], 20)],
'5, 10, and 20 day Lookahead Prices for Slice of {} Stock'.format(apple_ticker))
实现 get_return_lookahead 以生成收盘价和前瞻价格间的对数收益率 。
def get_return_lookahead(close, lookahead_prices):
"""
Calculate the log returns from the lookahead days to the signal day.
Parameters
----------
close : DataFrame
Close price for each ticker and date
lookahead_prices : DataFrame
The lookahead prices for each ticker and date
Returns
-------
lookahead_returns : DataFrame
The lookahead log returns for each ticker and date
"""
return np.log(lookahead_prices) - np.log(close)
project_tests.test_get_return_lookahead(get_return_lookahead)
我们将通过与上个问题相同的前瞻价格和部分 Apple 股票数据,查看前瞻收益率。
为了在股票图表上查看价格收益率,我们将添加第二个 y 轴。在查看此图表时,股价坐标轴将位于左侧,与之前的图表一样。价格收益率的坐标轴将位于右侧。
price_return_5 = get_return_lookahead(close, lookahead_5)
price_return_10 = get_return_lookahead(close, lookahead_10)
price_return_20 = get_return_lookahead(close, lookahead_20)
project_helper_zh.plot_price_returns(
close[apple_ticker].iloc[150:250],
[
(price_return_5[apple_ticker].iloc[150:250], 5),
(price_return_10[apple_ticker].iloc[150:250], 10),
(price_return_20[apple_ticker].iloc[150:250], 20)],
'5, 10, and 20 day Lookahead Returns for Slice {} Stock'.format(apple_ticker))
根据价格收益率生成信号收益率。
def get_signal_return(signal, lookahead_returns):
"""
Compute the signal returns.
Parameters
----------
signal : DataFrame
The long, short, and do nothing signals for each ticker and date
lookahead_returns : DataFrame
The lookahead log returns for each ticker and date
Returns
-------
signal_return : DataFrame
Signal returns for each ticker and date
"""
return signal*lookahead_returns
project_tests.test_get_signal_return(get_signal_return)
继续使用之前的前瞻价格查看信号收益率。与之前一样,信号收益率的坐标轴位于图表的右侧。
title_string = '{} day LookaheadSignal Returns for {} Stock'
signal_return_5 = get_signal_return(signal_5, price_return_5)
signal_return_10 = get_signal_return(signal_10, price_return_10)
signal_return_20 = get_signal_return(signal_20, price_return_20)
project_helper_zh.plot_signal_returns(
close[apple_ticker],
[
(signal_return_5[apple_ticker], signal_5[apple_ticker], 5),
(signal_return_10[apple_ticker], signal_10[apple_ticker], 10),
(signal_return_20[apple_ticker], signal_20[apple_ticker], 20)],
[title_string.format(5, apple_ticker), title_string.format(10, apple_ticker), title_string.format(20, apple_ticker)])
project_helper_zh.plot_signal_histograms(
[signal_return_5, signal_return_10, signal_return_20],
'Signal Return',
('5 Days', '10 Days', '20 Days'))
project_helper_zh.plot_signal_to_normal_histograms(
[signal_return_5, signal_return_10, signal_return_20],
'Signal Return',
('5 Days', '10 Days', '20 Days'))
发现直方图中的离群值后,我们需要找到导致这些离群收益率的股票。我们将使用 Kolmogorov-Smirnov 检验(简称 KS-检验)。我们会将此检验应用到存在做多或做空信号的每个股票信号收益率上。
# Filter out returns that don't have a long or short signal.
long_short_signal_returns_5 = signal_return_5[signal_5 != 0].stack()
long_short_signal_returns_10 = signal_return_10[signal_10 != 0].stack()
long_short_signal_returns_20 = signal_return_20[signal_20 != 0].stack()
# Get just ticker and signal return
long_short_signal_returns_5 = long_short_signal_returns_5.reset_index().iloc[:, [1,2]]
long_short_signal_returns_5.columns = ['ticker', 'signal_return']
long_short_signal_returns_10 = long_short_signal_returns_10.reset_index().iloc[:, [1,2]]
long_short_signal_returns_10.columns = ['ticker', 'signal_return']
long_short_signal_returns_20 = long_short_signal_returns_20.reset_index().iloc[:, [1,2]]
long_short_signal_returns_20.columns = ['ticker', 'signal_return']
# View some of the data
long_short_signal_returns_5.head(10)
上述代码会提供要在 KS-检验中使用的数据。
下面实现函数 calculate_kstest 以使用 Kolmogorov-Smirnov 检验(KS 检验)对比正态分布和每个股票的信号收益率分布。针对每个股票的信号收益率在正态分布上运行 KS 检验。使用 scipy.stats.kstest 进行 KS 检验。在计算信号收益率的标准差时,请将自由度设为 0。
对于此函数,不建议寻找向量化解。请迭代更新 groupby 函数。
from scipy.stats import kstest
def calculate_kstest(long_short_signal_returns):
"""
Calculate the KS-Test against the signal returns with a long or short signal.
Parameters
----------
long_short_signal_returns : DataFrame
The signal returns which have a signal.
This DataFrame contains two columns, "ticker" and "signal_return"
Returns
-------
ks_values : Pandas Series
KS static for all the tickers
p_values : Pandas Series
P value for all the tickers
"""
ks_dict = {}
p_dict = {}
mean = long_short_signal_returns.mean()
std = long_short_signal_returns.std(ddof=0)
for signal_return in long_short_signal_returns.groupby('ticker'): # signal_return[0]=ticker, signal_return[1]=signal_return
value = signal_return[1]['signal_return'].values
ks, p = kstest(value, 'norm', args=(mean, std)) # performs a test of the distribution F(x) of an observed random variable against a given distribution G(x). Under the null hypothesis, the two distributions are identical, F(x)=G(x). The alternative hypothesis can be either ‘two-sided’ (default), ‘less’ or ‘greater’.
ks_dict[signal_return[0]] = ks
p_dict[signal_return[0]] = p
return pd.Series(ks_dict), pd.Series(p_dict)
project_tests.test_calculate_kstest(calculate_kstest)
使用在上面创建的信号收益率计算 ks 和 p 值。
ks_values_5, p_values_5 = calculate_kstest(long_short_signal_returns_5)
ks_values_10, p_values_10 = calculate_kstest(long_short_signal_returns_10)
ks_values_20, p_values_20 = calculate_kstest(long_short_signal_returns_20)
print('ks_values_5')
print(ks_values_5.head(10))
print('p_values_5')
print(p_values_5.head(10))
计算 ks 和 p 值后,我们看看哪些股票是离群值。实现 find_outliers 函数以查找以下离群值:
pvalue_threshold。ks_threshold 的代码。def find_outliers(ks_values, p_values, ks_threshold, pvalue_threshold=0.05):
"""
Find outlying symbols using KS values and P-values
Parameters
----------
ks_values : Pandas Series
KS static for all the tickers
p_values : Pandas Series
P value for all the tickers
ks_threshold : float
The threshold for the KS statistic
pvalue_threshold : float
The threshold for the p-value
Returns
-------
outliers : set of str
Symbols that are outliers
"""
ks = set(ks_values[ks_values>ks_threshold].index)
p = set(p_values[p_values<pvalue_threshold].index)
return ks & p
project_tests.test_find_outliers(find_outliers)
使用你实现的 find_outliers 函数看看我们查找到哪些代码。
ks_threshold = 0.8
outliers_5 = find_outliers(ks_values_5, p_values_5, ks_threshold)
outliers_10 = find_outliers(ks_values_10, p_values_10, ks_threshold)
outliers_20 = find_outliers(ks_values_20, p_values_20, ks_threshold)
outlier_tickers = outliers_5.union(outliers_10).union(outliers_20)
print('{} Outliers Found:\n{}'.format(len(outlier_tickers), ', '.join(list(outlier_tickers))))
将没有离群值的 5 天、10 天和 20 天信号收益率与正态分布进行比较,并且看看在删除离群值后,p 值有何变化。
good_tickers = list(set(close.columns) - outlier_tickers)
project_helper_zh.plot_signal_to_normal_histograms(
[signal_return_5[good_tickers], signal_return_10[good_tickers], signal_return_20[good_tickers]],
'Signal Return Without Outliers',
('5 Days', '10 Days', '20 Days'))
更符合预期了。收益率更接近正态分布。你已经完成了突破策略的研究阶段,可以提交项目了。